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Localizing and recognizing arbitrarily oriented text in natural scene images 
is the biggest challenge. It is because scene texts are often erratic in shapes. 
This paper presents a simple and effective graph representational algorithm 
for detecting arbitrary-oriented text location to smoothen the text recognition 
process because of its high impact and simplicity of representation. An 
arbitrarily oriented text can be horizontal, vertical, perspective, curved 
(diagonal/off-diagonal), or even a combination. As a pre-processing step, 
image enhancement is performed in the frequency domain to improve the 
representation of images that are invariant to intensity. It is necessary to 
draw bounding boxes for each candidate character in the scene images to 
extract text regions. This step is carried out by utilizing the advantage of the 
region-based approach called maximally stable extremal regions. A typical 
problem with curved text localization is that non-text objects may occur 
within localized text regions. Our method is the first in the literature that 
searches for dominating sets to solve this problem. This dominating set 
method outperforms several traditional methods, including deep learning 
methods used for arbitrary text localization, on challenging datasets like 13" 
international conference on document analysis and recognition (ICDAR 
2015), multi-script robust reading competition (MRRC), CurvedText 80 
(CUTE80), and arbitrary text (ArT). 
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1. INTRODUCTION 


Text present in scene images can be in two categories into superimposed and scene text. If text is 
rendered artificially during the production of an image, then it is called superimposed text. If it is a part of an 
image, then it is scene text. Localizing and recognizing scene text is more complex compared to 
superimposed text as it is a part of the scene image. When scene text is in arbitrary shape, then it becomes 
more challenging compared to regular scene text. This grabbed the attention of researchers from the field of 
computer vision, pattern recognition, and artificial intelligence due to the variety of its applications like 
image indexing, language translators, automatic driving (navigation reading), reading assistance, and many 
more. Scene text reading process has two steps, the first step is extracting text which means separating text 
from the background, and second step is recognition of text which means making the computer recognize 
text. The initial step of scene text reading is to find the location of text regions. Usually, text regions are 
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erratic. From the observations made on an arbitrary text, all text shapes are categorized mainly into 
perspective text and non-perspective text. The perspective text classifies as the left perspective and the right 
perspective. Non-perspective text is classified as vertical, horizontal, diagonal, and off-diagonal, where 
diagonal and off-diagonal shapes can be straight-lined or curved. Any orientation text found on any scene 
images should fall under any of the categories mentioned above. In Figure 1, Figure 1(a) shows the input 
image, while Figures 1(b) and (c) are existing system and proposed system respectively. 


Non-text object within 


: : Only Text objects 
localized region 


within localized region 


(a) (b) (c) 


Figure 1. Arbitrary shaped text detection challenges (a) original image, (b) output of traditional MSER with 
bounding box (red box) [1], suffers from non-text object, and (c) output of DS based method (red box) 
includes only text objects 


In literature, scene text reading manuscripts can be split into two groups. The one way is using the 
deep learning method detecting text blocks instead of individual characters and then grouping them into text 
lines [2]—[11], [12]-[15]. The other way is the traditional method localizing individual characters first and 
then grouped into words. These words are extracted from its background [16]—[22]. Most of the authors use 
deep neural network (DNN) methods for arbitrary oriented text. However, very few authors have contributed 
to the traditional method. Zhu and Du [15] consider the degree of blur and gradient value of points to 
distinguish between text and non-text regions, k-means clustering used to separate them. Then the text region 
is divided into four sub-regions. Finally, symmetry features are fetched by Bhattacharya distance measure to 
avoid misclassification. These features are input to the support vector machine (SVM) classifier for better 
classification. Methods [16], [17] apply Laplacian of Gaussian (LOG) to find fully connected components 
results in true positive text candidates by removing false positives. Basavaraju et al. [19] used Gaussian low 
pass filter and 2D discrete wavelet transform (DWT) methods for better feature extraction. Later, candidate 
text was extracted using k-means clustering. LOG is applied to correct disconnectivity of the text edges, and 
then the Euler number is computed to store true text. Finally, for segmentation Gaussian mixture 
model (GMM) was used. The pixel-based method [19] generates edges around the arbitrary text using 
standard deviation. The text line is determined using the double line structure method. 

The method maximally stable extremal region (MSER) for multi-oriented text used for stable region 
extraction, and then it is combined with canny edge detection for enhancing text edges. Geometric properties 
of the text and stroke width transformation (SWT) are used to filter out non-text regions from the candidate 
text region [20]. However, they were not successful in case of curved text. The bounding box created around 
the text (red color) from existing method has a non-text object in it. To avoid this, bounding boxes for each 
character need to be addressed appropriately. This made to focus on this issue and successful in getting 
optimal box (red color) around the text (as illustrated in Figure 1) using domination graph theory. This paper 
brings a very simple concept of dominating set construction. This set helps in creating a precise box around 
any oriented text. The contributions of this paper are listed: i) proposed a simple and effective dominating set 
(DS) based method, ii) DS based method tested on bilingual text (English and Kannada) and iii) DS method 
evaluated on benchmark datasets such as 13th international conference on document analysis and recognition 
(ICDAR 2015), arbitrary text (ArT), CurvedText 80 (CUTE80), and multi-script robust reading competition 
(MRRC) datasets. The order of the paper follows research method, results, and discussion, and as the last 
section conclusion is summarized. 


2. RESEARCH METHOD 

In this section, basic terminology related to domination graph theory and then newly designed 
algorithms are presented. The image enhancement step reads scene image, processes it with 
frequency-domain filtering technique, and outputs enhanced image. The candidate text detection step uses a 
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region-based method on enhanced image and outputs bounding boxes for each character. Bounding boxes are 
given for dominating set construction and selection method to extract arbitrary oriented text optimally. Then 
the dominating point of each bounding box is selected depending on the orientation of the text. The pendent 
dominating set is chosen as dominating set in case of the perspective text. If the text orientation is of 
non-perspective type, then isolated dominating sets are considered as dominating set. The flow diagram of 
the proposed method is shown in Figure 2. 


Input Image Localized Text (Yellow 
color) 


Figure 2. Flow diagram of the proposed method 


2.1. Terminologies and definitions for text extraction and recognition system 

Domination is a widely used concept in graph theory, with applications in psychology, computer 
science, nervous systems, artificial intelligence, coding theory, and decision-making theory. In a graph, the 
DS is a useful tool for analyzing various problems in the field, such as networking, pattern recognition, 
clustering, traffic planning, biological modeling, facility location problems, school bus routing, and other 
issues in the medical sciences. The following subsections define the terminology needed for text extraction 
and recognition system. 


2.1.1. Definitions 

Dominating set (DS): A set S is a subject of V is called as dominating set for a given graph 
G=(V, E), i) if it satisfies that every vertex v of V is either an element of S or ii) an adjacent to any of the 
element of the set S. Induced subgraph: Let us consider a graph G=(V, E), and a set SCV be a subset of G. 
Then the induced subgraph G[S] is a graph with vertex set S and edge set same as E of G. Isolate dominating 
set (IDS): A set S is said to be an Isolate dominating set if it satisfies two condition: i) S must be dominating 
set and ii) There will be least one isolated vertex [23] present in induced subgraph G[S]. The example as 
shown in Figure 3. Pendent dominating set (PDS): A set S is said to be pendent dominating set if, 1) S has to 
be dominating set and ii) S contains at least one pendent vertex. The example as shown in Figure 3. 
Maximally stable extremal region (MSER): It is one of the stable region detectors. Text present in scene 
image is of different sizes, different orientated, and different languages. MSER has properties such as: i) 
stability for a different size, ii) different language, and iii) different orientation. We were influenced by this 
and used MSER to detect text regions. MSER Q;: “Let Qj),..., Qi-1, Qi... be a sequence of nested extremal 
regions (Q;CQ;.;). Extremal region Qi» is maximally stable if q(i)=|Qi+a\Qia\/|Qi] has a local minimum at i*. 
Here |.| symbol represents cardinality of the set. AES is parameter of the method” [16]. Qi+ can be computed 
using (1). 


Qi : i * = argmin; |Qi4. \Qi-al/1Q (1) 


All regions obtained by the MSER detector need not be text always. Some regions are text, and 
some are non-text due to the complexities involved in the scene images. To eliminate non-text regions two 
level filtering such as geometric properties of the text and stroke width variance is applied. Finally bounding 
box is created for potential candidate text characters. These bounding boxes of characters are combined to 
form text. Before constructing dominating set, following rules to be considered for finalizing bounding 
boxes. If a bounding box completely overlaps another, we ignore the inner bounding box and consider the 
outer one for future computation. If a bounding box partially overlaps with another, and their size ratio is 
1:0.5 or less than said ratio’s extreme, only one bounding box will replace them by taking coordinates 
(Xmin, Ymin) and (Xmax, Ymax). These finalized bounding boxes are input to the dominating set construction 
algorithm. 
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All possible Dominating sets for Vertiess of Dominating sets âre 
Graph : G(V, E) pra a marked in red color on graph Type of the dominating set 
the given graph G(V, E) GO, E) 
Vi V2 
$1={Vi,V2} Pendent Dominating set 
V4 V3 
7, 
S2={V2, Vs} “i V: Pendent Dominating set 
ai V: A (Used in algorithm1) 
Vi V: 
v v2 S3={V3, Va} Pendent Dominating set 
ja ET 
7, 
Vi V3 S4={V1, V4} vı Vz Pendent Dominating set 
Be v v, (Used in algorithm1) 
Vi V2 a 
E Isolated Dominating set 
SS={V, Vs} V: TY KA (Used in algorithm2) 
Vi V2 ar 
= Isolated Dominating set 
S6={V2, Va} vı a A (Used in algorithm2) 


Figure 3. Illustrative example for dominating set construction for a given graph 


2.1.2. Algorithms 

This paper uses dominating set-based method to construct dominating set from each bounding box 
character selecting one dominating point from the upper text line and one from the bottom text line. 
Therefore, each character has two dominating points. For example, if a text has n number of characters, then 
dominating set consists of 2n+2(two corner points) points. Then all the dominating points selected need to be 
connected clockwise to get a polygon-shaped box around the text. Algorithms for proposed ds based arbitrary 
oriented bilingual text localization for both perspective and non-perspective text are given above and 
followed by illustrations as shown in Figures 4 and 5, respectively. 


Algorithm 1. Perspective text proposed algorithm has following steps 

Input: A set of bounding boxes of text 

BB={ BB1 (Vi1, Viz, Viz, Via), BB2 (V21, V22, V23; V2a), «++, BBn (Vn, Vn2, Vn3, Vna) }, where 
Vig= (Xij; Yij); i=l, 2, m., n and j=1,2,4 

Output: Optimal arbitrary oriented text localization. 

1. For i=l to n-1 do 

2. if (x coordinate increases and yii and yi+ij1 coordinates increases and yi4 and yģ(i+1)4 
decreases) 

3. dominating set=right pendent dominating set (pds) // shown in Figure 4: case 1 

4. else if (x coordinate increases and yii and y i+)1 coordinates decreases and yi4 and yvitiya 
increases) 

5. dominating set=left pendent dominating set (pds) // shown in Figure 4: case 2 

6. else 

7. apply algorithm 2 

8. End For 

Connect all points do start from any point move in clockwise direction and reach starting 
point. 


Algorithm 2. Non perspective text proposed algorithm has following steps 

Input: A set of bounding boxes of text 

BB={BB1 (Vii, Viz, Viz, Vis), BB2 (V21, V22, V23; V2a), «++, BBN (Vna, Vn2, Vn3, Vna) }, where 
Vij= (Xij; yas), i=l, 2, .., n and j3=1,2,4 

Output: Optimal arbitrary oriented text localization. 

1.For i=l to n-1 do 

2. If (x & y coordinates are increasing (BBi <=BB(i+1)) ) // shown in Figure 5: case 4 

3. dominating set=off diagonal isolated dominating set (IDS) 

4. else 

5. dominating set=diagonal isolated dominating set (IDS) // shown in Figure 5: case 1, 2, 
3 

6 

7 


End For 


Connect all points do start from any point move in clockwise direction and reach 
starting point. 
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Case 1: Case 2: BBi (it Bounding Box)- represented in the form of graph with 4 
vertices for ith character in the text. 


Vi= (xin, yit) V= (Xi, yi2) 


BB1 BB2 BB3 BB4 BB1 BB2 BB3 BB4 
[r EEIE ntel] + | 


Dominating set=right pds 
= Dominating set=left pds V= (Xi3, yi3 Vis= (xis, yi 4) 


Figure 4. Illustration of perspective text 


Case3: Off Case4: Diagonal (line / 
Diagonal (line / curve) 
curve) 


= ae E b 
5 

Dominating set = diagonal IDP Dominating set = off 

diagonal IDP 


Figure 5. Illustration of non-perspective text 


3. RESULTS AND DISCUSSION 

In this section, the proposed DS-based method demonstrated on four benchmark datasets and their 
challenges. Results obtained for randomly selected images. Performance analysis with existing methods is 
explained in the following subsections. 


3.1. Datasets 

We have experimented with the proposed method on publicly available datasets such as ICDAR, 
Art, and CUTE-80 and regional language-MRRC dataset. The following subsections from 3.1.1 to 3.1.4 
describes about datasets and their complexities. Figures 6(a) are input images randomly chosen from various 
datasets described in subsection 3.1.1 to 3.1.4, Figures 6(b) are the output images without DS based 
algorithm, and Figures 6(c) are the output images using DS based algorithm. Similarly, Figures 6(d) are input 
images randomly chosen, Figures 6(e) are the output images without DS based algorithm, and Figures 6(f) 
are the output images using DS based algorithm. 


3.1.1. ICDAR2015 

It includes 1,000 images for training and 500 images for testing. These are images taken by Google 
Glasses with improper focusing. Figure 6(c) and (f) in 3" row shows the experimental results of the proposed 
method on the MRRC dataset. 


3.1.2. ArT dataset 

Arbitrary text (ArT) provides 10,166 images. These are taken from total-text, SCUT-CTW1500, and 
baidu curved scene text datasets. These images split into 5,603 images as training set and 4,563 images as 
testing set. Figure 6(c) and (f) in 1“ row shows the experimental results of the proposed method on the 
MRRC dataset. 


3.1.3. MRRC dataset 

This dataset provides 167 images for training images and 167 images for testing. This dataset 
includes different language texts such as Kannada, English, Hindi, and Chinese. For the experiment, we 
consider bilingual (Kannada and English language) text, as it is one of the objectives of this paper. These 
images suffer from challenges, such as inference factors during image acquisition, environmental effects, and 
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diversified text. Figure 6(c) and (f) in 4™ row shows the experimental results of the proposed method on the 
Multi-script robust reading competition (MRRC) dataset. 


3.1.4. CUTES0 dataset 

CUTE80 is the first publicly available curved text dataset [8]. It consists of eighty curved text line 
images with environmental challenges that are complex background, inference factors during image 
acquisition such as perspective distortion and low resolution (in the circle, S, Z shaped text lines). These 
images are indoor or outdoor or obtained from internet sources taken from a digital camera. Results obtained 
from the proposed method on CUTE80 are presented in Figure 6(c) and (f) in 2" row shows the experimental 
results of the proposed method. 


Dataset and Input image Output without Output with DS Input image Output without Output with DS 
Text type DS based based algorithm DS based based algorithm 
algorithm algorithm 


Art s PES 
(Horizontal a as NAN 
and Curved T SEKTOREL 


text) r me y DNAN 
i SRKEY 


RBS 
Ne 
MM 
SEKTOREL 
ON. AN 

TRY 


Cute80 


(Perspective Gites) 
and curved ee i 
‘Pogplate WO 


text) 


ICDAR2015 

(Perspective 

and Vertical 
Text) 


MRRC 
(Bilingual, 
Horizontal 

and Curved 
text) 


x 
fame 


(b) 


Figure 6. Localization results of the proposed method (a) input images taken from various datasets, (b) output 
without DS based algorithm, (c) output with DS based algorithm, (d) input image, (e) output without DS 
based algorithm, and (f) output with DS based algorithm 


3.2. Performance analysis 

Performance of the proposed algorithm is measured with precision, recall, and F-score metrics. To 
evaluate localization performance, precision, recall, and F-score computation are carried out using (2), (3), 
and (4) respectively. 


Recall = 1P 


TP+ FN (2) 
_. TP 
P = Precision = eT (3) 
2*R*P 
F = F — measure = REP) (4) 


Here TP: total texts detected correctly, FN: number of texts not detected as texts, FP: number of non-texts 
detected as texts. The proposed dominating set-based method for text localization is experimented on the 
MRRC dataset to prove that it is both an orientation and language-independent method. The proposed 
method increases the true positive counts by reducing false-positive compare to the robust text detection 
technique [24] and LOG-based structural arbitrary method [16]. Precision, recall, and F-score values are 
given in Table 1. 
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Table 1. Performance comparison of DS-based method with existing methods on MRRC dataset 


(Kannada and English) 
Methods Precision (%) Recall (%) F-score (%) 
Basavaraju et al. [16] 69.5 81.8 75.1 
Yin et al. [24] 42.0 64.0 51.0 
Proposed method 81.4 78.7 78.6 


The proposed DS-based method not only improves performance by increasing recall rate and 
F-score rate in comparison with robust text detection technique [24], cascaded method [25], color prior 
guided MSER [26], and multi oriented text detection with fully convolutional network (FCN) [27]. But also 
shows less precision result compare to the method [27], this is due to the illumination problem in the 
ICDAR2015 dataset. Precision, recall, and F-score values are given in Table 2. 


Table 2. Performance comparison of proposed and existing methods on ICDAR2015 dataset 


Methods Precision (%) Recall (%) F-score (%) 
Yin et al. [24] 32.1 49.5 38.9 
Zheng et al. [25] 39.5 61.6 48.1 
Zhang et al. [26] 55.7 42.1 48.9 
Zhang et al. [27] 71.0 43.00 54.0 
Proposed method 54.6 64.3 59.1 


In Figure 7, input image (refer Figure (a) and (b)), text localization results of the proposed (refer 
Figure (c) and (d)) and the existing methods (refer Figure (e) and (f)) for arbitrary oriented text is presented. 
It is also evident that DS based method creates a better boundary around irregularly shaped text as shown in 
Figures 7(e) and 7(f) compared to many conventional methods shown in Figures 7(c) and 7(d) with the help 
of powerful graph theory representation. And it shows results are encouraging to compare with existing 
methods. 


(b) (d) (f) 


Figure 7. Text localization results of the proposed and the existing methods for arbitrary oriented text, 
(a) and (b) input images, (c) result of [6], (d) result of [1], (e) proposed DS based method, 
and (f) Proposed DS based method 


4. CONCLUSION 

This paper explores a novel text localization method for arbitrary oriented text in natural scene 
images. Dominating set construction for text localization developed in this work exploits the concept of 
graph theory to optimally extract text from scene images of different orientations and languages. The newly 
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designed algorithm performed well on natural scene images that contain both English and Kannada 
languages. The proposed algorithm has demonstrated promising results compared to popular benchmarking 
techniques such as cascaded methods, MSERs with color prior guided, and multi-oriented text detection 
using FCNs. To increase recognition accuracy in the future, one can consider using advanced preprocessing 
techniques like noise removal, enhancement, and deblurring, to improve image quality. The proposed 
DS-based method can be applied to live video data in real-time to determine the algorithm’s consistency, 
efficacy, and effectiveness at the development time of intelligent systems. Furthermore, domination is the 
one that has a practical interest and excels at analyzing various real-time problems from any of the fields if it 
is possible to represent the problem in terms of graphs. 
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